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Inaccurate cost estimates have substantial effects on the final cost 
of construction projects and erode profits. Cost estimation at 
conceptual phase is a challenge as inadequate information is 
available. For this purpose, approaches for cost estimation have 
been explored thoroughly, however they are not employed 
extensively in practice. The main goal of this paper is to comparing 
the performance of various models in predicting the cost of 
construction projects at early conceptual phase in the project 
development. In this study, on the basis of the actual project data, 
three modeling algorithms such as random forest, support vector 
machine and artificial neural networks are used to forecast the 
construction cost of Ethiopian highway projects. The three models 
were then compared based on the outcomes of prediction and root 
mean square error. The findings revealed that random forest 
outperforms neural network and _ support vector machine in 
realizing better prediction accuracy. Based on root mean square 
error, the random forest cost model provides 18.8% and 23.4% 
more accurate result than neural network and _ support vector 
machine models respectively. It is anticipated that a more reliable 
cost estimation model could be designed in the early project phases 
by using a random forest regression technique in the development 
of a highway construction cost estimation model. In conclusion, the 
practitioners in the highway construction industry can make sound 
financial decisions at the early phases of the project development 
in Ethiopia. 
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1. Introduction 


In construction practice of developing and transition countries, incompetency is a well-known 
fact in completing projects on time [1]. Success of construction projects is evaluated by realizing 
to budget, timing, and quality of work as per client's expectations Accurate cost estimation in the 
preliminary stage of a project is essential for decision-makers to control the overall project [2]. In 
addition, the importance of early estimation from the viewpoint of owners and related project 
teams cannot be over-emphasized [3]. Adequate estimation of construction cost is key factor in 
any type of construction projects. However, forecasting cost of construction projects can be 
considered as challenging task [4]. Moreover, Ma et al. [5] stated that construction cost 
estimation, which is normally labor-intensive and error-prone, is one of the most important 
works concerned by multi-participants during a project’s life cycle. Previous studies have 
showed that the combination of predictive analytics and historical data can upswing cost 
estimation in construction projects. However, there exists a challenge in accurately estimating 
the cost of projects at the conceptual phase [6-8]. 


In order to simplify the aforementioned problems and estimate the construction project cost more 
accurately and rapidly, this study puts forward a method of cost estimation of construction 
project based on machine learning algorithms. So, the study is going to discourse our work on 
predicting the cost of highway construction projects with few project features or attributes. This 
is a typical regression problem in which this paper aims to predict the cost of a highway project 
given its features. The inspiration in doing such investigation is to provide all contracting parties 
accurate information about the expected cost of highway projects at its early phase with minimal 
errors. Upon the completion of the different modelling algorithms, an evaluation of each model 
is conducted by comparing its accuracy. 


2. Literature Review 


Various estimation techniques and methods are available. With the improvements in computing 
capability, latest cost estimating techniques tend to use more complex approaches and a greater 
size of data. Machine learning algorithms as part of artificial intelligence, which allow exploring 
multi- and non-linear relationships between variables and final costs, have been employed in 
recent years [9-11]. In particular, abundant applications of support vector machines, artificial 
neural networks and random forest regression in the various field of civil engineering are 
described for prediction as well as optimization problems [12—14]. In this section, the extant 
literatures related to regression problems in the realm of construction are comprehensively 
reviewed. 


Aiming to minimize the prediction error in conceptual estimates, Dursun and Stoy [15] adopted a 
multistep ahead (MSA) approach relies on the idea of using several cascaded estimations to 
predict future values. Based on the test outcomes obtained from 657 building projects, MSA 
approach significantly outperforms the prediction accuracy of linear regression (LR) and 
artificial neural network (ANN) techniques. Petruseva et al. [16] predicted the bidding price in 
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construction using support vector machine (SVM). Yousefi et al [17] proposed ANN model to 
forecast cost and time claims in Iran construction projects. Magdum and Adamuthe [2] presented 
construction cost prediction models using LR and ANN and the results revealed that ANN give 
better prediction accuracy than statistical regression method. The accuracy of LR and SVM 
models in forecasting construction costs was compared in study conducted by [18]. PeSko et al. 
[19] estimated of durations and costs of construction of urban roads using ANN and SVM. Shin 
[20] also developed a model using SVM to predicting the construction safety and health 
management cost. In recent times, random forest (RF) have been applied in various real world 
regression and optimization problems [21—24]. Random forest was also applied in construction 
world [25,26]. Kang and Ryu [14] predicted types of occupational accidents at construction sites 
using random forest model. In summary, estimation of cost of the construction of highways by 
using RF, ANNs and SVM is not present in the literature. This study is, therefore, aimed to make 
comparison of the performance of the three models in forecasting the cost of highway projects. 


3. Methodology 


In the process of dealing with a regression problem described in this study, k-fold cross 
validation and Root Mean Squared Error (RMSE) metric is employed to run and validate the 
modelling process and make a comparative assessment respectively. RMSE is the most important 
criterion for fit if the main purpose of the model is a prediction [6]. In particular, the 
methodology followed in this study is: (a) each method trained using 3-fold cross-validation and 
(b) final RMSE is computed based on the average results of all training steps. Three model 
algorithms such as SVM, ANN and RF are planned to be performed using Python programming 
with Scikit-Learn library packages. All study outcomes are prepared and presented using the 
various Python programing packages. 


4. Variables Analysis 


4.1. Data Collection and Description 


The historical project data was compiled by authors from the Ethiopian Road Authority (ERA) 
management system software. The highway projects which have been started and completed 
between 2006 and 2018 are considered in the process of developing the historical data base. The 
project costs data set has 8 variables (features) to predict the cost in which 4 numerical variables 
and 4 categorical variables. The variables include project length, number of bridges, inflation 
rate, project scope, terrain type, project type, contract duration and project location. There are 74 
project cases considered in this study. The project data are recorded and the model dataset are 
compiled based on the above-mentioned input variables. The description of the dataset is 
summarized in Fig. 1. 
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Project_Length No_of Bridges Inflation_Rate Project_Scope \ 


count 74.000000 74.000000 74.000000 74.000000 
68.905405 4.391892 15.202135 1.527027 
35.569407 5.411347 4.560355 0.831288 
10.000000 0.000000 8.225000 1.000000 
44.250000 0.000000 12.425000 1.000000 
63.500000 2.500000 16.180000 1.000000 
91.000000 6.750000 18.200000 2.000000 
180.000000 21.000000 25.250000 3.000000 
Terrain_Type Project_Type Contract_Duration Project_Location \ 
count 74.000000 74.000000 74.000000 74.000000 
1.762859 2.171723 951.094595 2.608108 
0.511248 1.104458 236.731058 1.488012 
1.000000 1.000000 90.000000 1.000000 
1.465775 1.000000 910.000000 1.000000 
1.679250 2.000000 1065.000000 2.000000 
2.000000 2.170625 1095.000000 4.000000 
2.700000 4.000000 1280.000000 5.000000 
Project_Cost 
count 7.400000e+01 
mean 5.187602e+08 
std 3.727179e+08 
min 7.011855e+05 
1.899287e+08 
4.903572e+08 
8.078543e+08 


1.533714e+09 
Fig. 1. Description of project dataset. 


4.2. Data Preprocessing 


Data preprocessing is important to produce outputs that can be smoothly utilized as inputs in data 
modeling by transforming the raw data. At this stage, the categorical variables are converted to 
numerical for the sake of model simplicity. In addition, the correlation matrix is generated to 
investigate the possible relationship among model input variables to minimize the likely impact 
on the prediction outcome. 
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4.3. Convert categorical variable to numerical variable 


Several machine learning based modelling algorithms need numbers as inputs, it requires to be 
coded as numbers in some way. Accordingly, the four features are coded with numbers before the 
modelling process get started. 


4.4. Correlation matrix or heatmap 


A correlation number gives the degree of association between two variables [27]. It is important 
to explore possible correlations between the dependent and the independent variables in 
modelling to better understand the data set. Linear regression models are sensitive to outliers, 
non-linearity and collinearity [6]; hence we are going to check these likely problems. For the 8 
variables, Fig.2 depicts the correlation between every two variables and each variable with 
project cost (dependent variable). Fortunately, in this figure, there are no highly inter correlated 
variables. Hence, we keep all of these variables when selecting and preparing the features to use 
in the modelling. On the other hand, project type and contract duration variables have a slightly 
higher correlation with the project cost when compared to other variables displayed in Fig. 2. 


Project_Length- 1 


No_of_Bridges -iUge 


Project_Scope -—ee 


Terrain_Type j 2.029 | -0.0 


ediia@iiecme 0.036 | -0.13 


Contract_Duration J 03 | ois | 


ges 


Project_Length 
No_of_Brid 
Inflation Rate 
Project_Scope 
Terrain_Type 
Project Type 
Contract_Duration 
Project_Location 


Fig. 2. Correlation of input variables with project cost. 


4.5. Log transformation of the dependent variable 


As recommended by the specialists, log transformation on the dependent variable i.e. project cost 
is applied. Log transform can transform data into ones that are symmetric and skewed. 
Relatively, it moves smaller values farther apart while it moves big values closer together (see 
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Fig. 3). This is the most imperative feature of log transformation [28]. Moreover, it is also easier 
to describe the relationship between variables when it’s approximately linear. Generally, it works 
well in modeling to treat the project cost data up to a high amount. 


le—9 
1.2 1.2 
1.0 1.0 
0.8 0.8 
0.6 0.6 
0.4 0.4 
0.2 0.2 
0.0 0.0 /iRKL 
O 1 2 6 8 10 
Project_Cost 1le9 Project_Cost 


Fig. 3. Histogram of project cost with/ without log transformation. 


4.6. Feature scaling 


Standardizing the data is another important step in the preprocessing phase because this will be 
helpful for all the models. The scaling was done independently for the training and the testing 
sets as cross-validation is employing in this study. 


5. Performance evaluation of models 


This section presents the evaluation of different scikit-learn modeling algorithms. The final step 
is to evaluate the performance of each modeling algorithm. This step is particularly important to 
compare how well different algorithms perform on a certain dataset. In this study, RMSE 
evaluation metric which is the square root of the mean of the squared errors [28] is used as 
RMSE amplifies and severely punishes large errors [10] and its equation is written as follows: 


1 ~ \2 
RMSE = |*57_,(y) - 3) (1) 


Where y;stands for log (Project_Costj) and y; stands for log (predicted Project_Cost)). 
Fortunately, it is not required to perform this calculation manually. The Scikit-Learn library 
comes with pre-built functions that can be used to find out RMSE value. Finally, the RMSE 
results are utilized to make a comparative analysis as it shows the prediction accuracy of all the 
models. 
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6. Results and discussions 


The ANN, SVM and RF models were developed using the test dataset for predicting in the final 
cost of highway projects. The performance of the three models were evaluated using the model 
accuracy measures tabulated in Table 1. 


Table 1. Model performance results 


Models Average RMSE 
SVN model 1.2569 
ANN model 1.1802 

RF model 0.9579 


For the three models, RF helps practitioners or researchers acquire the most accurate prediction 
outcome in this cost data set with smaller error value. Conversely, SVM provides the worst result 
compared with ANN algorithms because lower values of RMSE indicate better fit. Based on 
RMSE values, the RF cost model provides 18.8% and 23.4% more accurate result than ANN and 


SVM models respectively. 


1.0 


0.6 : 


RMSE Values 


0.0 


ca 


Prediction Models 
Fig. 4. Comparison of model performances based on RMSE. 


The RF model predicted the cost of highway projects with RMSE value of 0.96 i.e., the 
difference between predicted and actual cost values were insignificant. Fig. 5 clearly portrays 
that the predicted project cost values were in strong coherence with those of actually collected 
cost values. This justifies that the RF model was able to generate the predicted cost results 
accurately when it compared to ANN and SVM models. 
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Log Transformed Cost 


Project Cases 


Fig. 5. Comparison of FR actual and predicted values. 
7. Conclusions 


The main objective of this study was to develop models for predicting the cost of highway 
projects and make a comparative assessment based on their accuracy using RMSE results. All 
the necessary computations including model developments were performed using different 
Scikit-Learn library packages in the Python programing. In this study, SVM, NN and RF 
algorithms were employed to forecast highway project costs. The results clearly revealed that RF 
has more accuracy in prediction with less error value when compared with NN and SVM. It can 
be generalized that the prediction done with RF portrays a strong degree of coherency with 
actually collected cost data of highway project against NN and SVM. So, this study will be 
helpful the contracting parties in the highway construction industry and the future works. A 
mobile app or simple desktop package can be created by storing the predicted data in the 
databases so that the contracting parties would really have a brief information and would safely 
invest the money on the proposed project. 
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